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Abstract 

Whole-genome sequencing of tumor tissue has the potential to provide comprehensive characterization of 
genomic alterations in tumor samples. We present Patchwork, a new bioinformatic tool for allele-specific copy 
number analysis using whole-genome sequencing data. Patchwork can be used to determine the copy number of 
homologous sequences throughout the genome, even in aneuploid samples with moderate sequence coverage 
and tumor cell content. No prior knowledge of average ploidy or tumor cell content is required. Patchwork is 
freely available as an R package, installable via R-Forge (http://patchwork.r-forge.r-project.org/). 
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Background 

Cancer is a disease in which somatic mutations lead to 
loss of proliferation control [1]. Genomic aberrations 
range from single-nucleotide mutations to copy number 
changes of sets of chromosomes, and can be recurrent 
in genomic regions, individual genes, and molecular 
pathways [2]. The number and complexity of genomic 
aberrations vary greatly between the different types of 
cancer. Recent large-scale studies have summarized the 
current knowledge in a genome-wide perspective [3-8]. 

Copy number aberrations affect both large and small 
portions of the genome. Methods such as spectral karyo- 
typing (SKY) and comparative genome hybridization 
have provided progressively more detailed information 
on copy number aberrations [9-11]. With the introduc- 
tion of high-density single-nucleotide polymorphism 
(SNP) arrays it is possible to obtain allele-specific infor- 
mation on a genome-wide scale [9,12]. Specialized soft- 
ware tools such as GAP (Genome Alteration Print), 
ASCAT (AUele-Specific Copy number Analysis of 
Tumors), and TAPS (Tumor Aberration Prediction 
Suite) were developed to use the allele-specific informa- 
tion to address issues such as aneuploidy and admixture 
of normal cells that complicate the analysis in tumor 
samples [13-15]. These tools provide allele-specific copy 
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number analysis (ASCNA), that is, analysis of the abso- 
lute number of each homologous copy. 

ASCNA can help identify the genotype of the ampli- 
fied or deleted copy, which may have a direct implica- 
tion on the tumor phenotype. Studies have shown that 
there may be preferential amplification of certain alleles 
in human tumors [16,17]. Perhaps more importantly, 
ASCNA helps interpret other somatic alterations, speci- 
fically point mutations. For example, if loss of hetero- 
zygosity (LOH) is detected in a region with a recessive 
mutation in a cancer-related gene, we can suspect a 
likely effect on tumor biology. ASCNA also facilitates 
reconstruction of the timing of mutational events 
through tumor development [2,18]. 

Recent advances in second-generation sequencing and 
data analysis are promoting whole-genome sequencing as 
an 'all-in-one' analysis for cancer genomes. Using whole- 
genome sequencing combined with bioinformatic tools it is 
possible to characterize an entire genome at base-pair reso- 
lution using a single molecular assay [19]. Several methods 
are available for copy number analysis of whole-genome 
sequencing data, but these do not provide absolute ASCNA 
[20,21]. Although tools that account for normal cell content 
have begun to emerge for whole-genome sequencing data 
[22], there is currently none that works without prior 
knowledge of the average ploidy. In this paper, we describe 
Patchwork, a tool for ASCNA of whole-genome sequencing 
data from tumor tissue. We found that performance was 
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comparable with array-based methods in terms of resolu- 
tion, sensitivity, and specificity, even with modest sequence 
coverage and thus this techniquie may obviate the need for 
copy number analysis based on SNP arrays. 

Results 

ASCNA with Patchwork is based on the same principles 
as TAPS, which was developed for SNP array data [15]. 
Quantitative information on total and allele-specific 
DNA content is obtained for genomic segments, and 
visualized in relation to all segments in the genome. 
The observed pattern is used to estimate absolute copy 
numbers and purity, and to determine input parameters 
for automatic calling of allele-specific copy numbers. 

Patchwork segments the genome based on total DNA 
content (normalized sequence coverage) using circular 
binary segmentation (CBS) [23]. For each segment, 
allele-specific information is used to estimate the rela- 
tive abundance of the two homologous copies. Unless 
sequenced in great depth, it is unfeasible to obtain such 
an estimate from the allelic read counts of single SNPs. 
The actual coverage at a SNP is affected not only by 
copy number, but by sequence bias and random sam- 
pling, and therefore varies greatly from average cover- 
age. However, along a segment containing many SNPs, a 
reliable measure of allelic imbalance can be achieved, 
even in samples with low coverage. In Patchwork, the 
allelic imbalance ratio of a genomic segment is calcu- 
lated as 

high low)/ ^ high, 

where Zlow and Zhigh are the number of reads with 
lower and higher observed allele counts summed over all 
heterozygous SNPs in the segment. Using sums of observed 
reads means that effectively each SNP is weighted accord- 
ing to its coverage, maximizing the use of the information. 
The allelic imbalance ratio and normalized coverage 
are plotted against each other for each genomic segment. 
Clusters of segments share allele-specific copy number. 
The allelic copy number compositions that emerge at dif- 
ferent coverage levels can be used to discern the absolute 
copy numbers and, thereby, the average ploidy of the 
tumor cells. Figure 1 illustrates the steps undertaken by 
Patchwork to process input data into interpretable figures 
and allele-specific copy numbers. For further details, please 
see the Materials and methods section. 

Patchwork can be used with any sequencing technol- 
ogy capable of producing SAM (Sequence alignment/ 
map) -formatted aligned reads, which includes the most 
common sequencers from Roche, lUumina, and Life 
Technologies. In addition, ASM (assembly)-formatted 
data from Complete Genomics can be used directly in a 
version of Patchwork called PatchworkCG (Figure 1). 



A patient-matched normal sample or a reference file 
based on diploid samples sequenced with a similar tech- 
nology is required. Reference files for Illumina/Solexa 
and Life Technologies/Solid are available with Patch- 
work. Users also have the option to build their own 
reference files from aligned reads obtained with their 
technology of choice. 

Performance validation using breast-cancer cell lines 

We applied Patchwork ASCNA to high-coverage whole- 
genome sequencing data from the breast-cancer cell line 
HCC1187 (approx. 60x coverage; Complete Genomics 
Inc., Mountain View, CA, USA). Performance was evalu- 
ated using a TAPS analysis of HCC1187 (Affymetrix SNP 
6.0; Affymetrix Inc., Santa Clara, CA, USA) as gold stan- 
dard. Both datasets indicated an average ploidy of 2.6, and 
matched a SKY karyotype of HCC1187 [24]. Patchwork 
performance was evaluated using all Patchwork-generated 
segments of at least 1 Mb that overlapped by at least 75% 
with any TAPS segments (see Materials and methods for 
further details). The result of the performance evaluation 
is displayed in Table 1. Patchwork detected allele-specific 
copy numbers with a sensitivity of 93 to 100% and a speci- 
ficity of 99 to 100%. 

To evaluate the performance of Patchwork under 
more challenging conditions, we used whole-genome 
sequencing data from the breast-cancer cell line 
HCC1954 (approx. 4x coverage; Illumina GAII; Illumina 
Inc., San Diego, CA, USA) and patient-matched cell line 
HCC1954BL with normal karyotype (approx. 5x cover- 
age; Illumina GAII). Sequencing reads were mixed from 
the two samples to resemble the effects of varying 
tumor cell content. A TAPS analysis of HCC1954 (Affy- 
metrix SNP 6.0) was used as the gold standard. The 
estimated copy numbers closely resembled a publicly 
available SKY karyotype of HCC1954 [24], but there 
were some discrepancies between the sequencing, 
microarray, and SKY data. These discrepancies were 
whole chromosomes or chromosome arms that differed 
in copy number. The array, sequencing, and SKY data 
came from different sources and DNA extractions, and 
such differences can most likely be explained by gain or 
loss of chromosomes during culture. Cancer genomes 
are not necessarily stable during culture, and genomic 
alteration and subclones in the cell populations are fre- 
quently seen [15,18,25]. These chromosomes were 
excluded from the evaluation (see Materials and meth- 
ods; see Additional file 1). 

Sequence reads from HCC1954BL were used to dilute 
HCC1954 reads in silico into samples representing 70%, 
50%, and 30% tumor cell content, and sensitivity and spe- 
cificity were assessed for different allele-specific copy 
numbers (Figure 2A,B; for the numbers of matching seg- 
ments, see Additional file 1). Some allele-specific copy 
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Figure 1 Patchwork flowchart. 1) Sequenced reads are aligned to the genome. 2) Single-nucleotide (and optionally short indel) variants that 
do not match the reference genome are extracted. 3) Systematic bias is removed by normalizing for GC content and other position-related 
effects. Coverage information from one or more diploid samples sequenced with the same method is used in this process. 4) The genome is 
segmented, based on the normalized coverage. 5) Informative heterozygous variants are identified. 6) Allelic imbalance ratio is calculated for 
each segment. 7) Visualization of allelic imbalance ratio and normalized coverage for genomic segments. 8) Manual interpretation of visualization 
to obtain input parameters for next step. 9) Allele-specific copy number is calculated for all genomic segments. Steps 3 to 7 and 9 are handled 
by the Patchwork.plot() and PatchworkcopynumbersO modules respectively. 



numbers such as 3ml and 4ml were detected with both 
sensitivity and specificity above 92% for tumor propor- 
tions down to 50%; despite the low sequence coverage, 
this resembled performance with 60x coverage. For other 
copy numbers, performance was lower than expected. A 
closer examination of the Patchwork scatter plots 



indicated that short segments of relatively high copy 
number appeared outside the expected clusters. This may 
have been due to subclones in the cell population. 
Another explanation may be that 4x coverage of the can- 
cer genome is just barely enough for Patchwork to sepa- 
rate these allele-specific copy numbers. Downsampling of 
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Table 1 Detection of allele-specific copy numbers in cell line HCC1187 using Patchwork, with (Affymetrix SNP 6 

Total copies, n Minor alleles, n True positives, n True negatives, n False positives, n False negatives, n Sensitivity, % Specificity, % 



1 0 16 679 

2 0 480 208 
1 7 688 

3 0 39 649 

1 46 646 

4 1 10 684 

2 69 618 



^Number of segments, sensitivity and specificity are shown. 

HCC1954 to 2x coverage indicated additional perfor- 
mance loss on all copy numbers (data not shown). 

Patchwork analysis of a breast-cancer primary tumor and 
metastasis 

We applied Patchwork to whole-genome sequencing data 
from a breast-cancer primary tumor, metastasis and 
xenograft (approx. 30x coverage; lUumina GAII) from a 
published study [26]; results of Patchwork visualizations 
for the primary tumor and the metastasis are shown in 
Figure 3. In the top panel, segments from the entire gen- 
ome seem to be closer to each other in the primary 
tumor (Figure 3A), which indicates a lower proportion of 
tumor cells than in the metastasis sample (Figure 3B). A 
description of how normalized coverage is affected by the 
proportion of tumor cells can be found in the Materials 
and methods section. 

We found that 97% of the genome (base pairs) 
matched in terms of total copy gain, total copy loss, and 
unchanged copy number between the primary tumor 
and metastasis, indicating a high similarity (see Addi- 
tional file 2). The average ploidy was almost 3.5 for both 



0 0 100.0 100.0 

0 7 98.6 100.0 

0 0 100.0 100.0 
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2 1 97.9 99.7 

1 0 100.0 99.9 

3 5 93.2 99.5 



samples, and allele-specific copy numbers were mostly 
identical throughout the genome. One exception was 
that most of chromosome 16q had three copies in the 
primary tumor and four copies in the metastasis, with 
retained heterozygosity in both samples (Figure 3AB). 
The xenograft displayed very variable sequence cover- 
age, likely due to contamination by mouse DNA. We 
used the Patchwork figures to visually compare the sam- 
ples, and found no copy number differences between the 
xenograft and the primary tumor. This is further illu- 
strated in whole-genome copy number profiles in Addi- 
tional file 2. 

Our analysis indicated systematically higher copy 
numbers than the originally published copy number 
analysis by Ding et al, [26], which may be due to the 
original analysis not taking the true average ploidy 
(approx. 3.5) into account, and thus underestimating the 
copy numbers. In addition, the findings by Ding et al. of 
more copy number alterations in the metastasis may be 
due to a lower sensitivity of detection in the primary 
tumor, which seems to have lower tumor cell content 
(Figure 3A,B; see Additional file 2). 




Figure 2 Patchwork performance on /n s/7/co dilutions of cancer and matched normal cell lines. TAPS (Tumor Aberration Prediction Suite) 
analysis of tine breast-cancer cell line HCC1954 (nearly tetraploid) was used as a gold standard for the evaluation of Patchwork performance on 
a slightly different culture from the same cell line. Reads from the patient-matched blood cell line HCC1954BL were added to the data to 
estimate the effect of a reduced proportion of tumor cells. Data from the pure cancer cell line and mixtures corresponding to 70%, 50%, and 
30% tumor cells were analyzed. (A) Sensitivity and (B) specificity for allele-specific copy number calls. 
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Figure 3 Allele-specific copy number analysis of a breast-cancer primary tumor and metastasis. (A,B) Patchwork analysis of (A) chromosome 
16 of a breast-cancer primary tumor, and (B) of the same chromosome of a patient-matched metastasis. Top panels: Patchwork plot for genomic 
segments; middle panels: total and minor copy number for chromosome 16; bottom panel: normalized coverage for chromosome 16. With a few 
short deviations, the entire chromosome 16 displays heterozygous copy number 3 in the primary tumor sample. The copy number profile is nearly 
identical for the metastasis sample, except that most of the q-arm has yet another copy of the amplified chromosome. 



A detailed view of chromothripsis 

Chromothripsis is a catastrophic rearrangement of chromo- 
somal regions that includes deletions and amplifications 
[27] . It has been associated with formation of fusion genes 
and double minute chromosomes. Typically, only one of the 
two homologous chromosomes is affected by the phenom- 
enon, leaving the other(s) intact. We used Patchwork to ana- 
lyze data from the breast-cancer cell line HCC2218 (approx. 
60x coverage; Complete Genomics) and found an example 
of chromothripsis on chromosome 17. All segments of chro- 
mosome 17 are shown in Figure 4. On a large portion of the 
q-arm, the highly variable normalized coverage indicates 
non-integer total copy numbers. ASCNA (bottom panel) 
indicates that the minor copy number remains at 1. This 
pattern matches those described for chromothripsis, and is 
consistent with the theory of fragmentation of one of the 
homologs, followed by aberrant multiplication, loss, and 
reassembly of the fragments [27] . Genes in the region may 
have been damaged, lost or amplified, coupled with a differ- 
ent promoter, or fused with another gene. Analysis of copy 
numbers (which may be heterogeneous and therefore appear 
non-integer) is an important part of understanding the can- 
cer genome in these highly aberrant regions. 

Discussion 

The importance of copy number analysis 

Characterization of cancer genomes benefit from 
ASCNA in three major ways. 1) ASCNA provides an 



accurate measure of total copy number in cases of aneu- 
ploidy. Finding the correct copy numbers rather than 
calling gain or loss relative to the average coverage, as is 
commonly done, may avoid false discovery of homozy- 
gous loss. 2) ASCNA reveals LOH, which indicates 
whether the tumor cells retain a constitutive allele that 
may render recessive mutations inconsequential. 3) 
ASCNA facilitates the identification and analysis of shat- 
tered chromosomes (chromothripsis), which is being 
recognized as an important type of genomic aberration 
and may be associated with poor prognosis. 

We believe that allele-specific copy numbers and normal 
cell content should be a part of the input information for 
analysis of events such as translocation breakpoints, point 
mutations, and short insertions and deletions. Because 
allele-specific copy numbers reflect the composition of the 
homologous copies along the genome, they can be used to 
reconstruct the set of events leading to formation of the 
observed cancer genome [18]. Restricting the analysis to 
total copy number and LOH may limit our understanding 
of the molecular genetic events that have taken place in a 
cancer genome. 

Limitations 

Patchwork provides information on total copy number 
and the number of copies of the minor allele, but does not 
assign copy number to specific SNP alleles, which may be 
desired, as certain alleles may be preferentially amplified in 
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Figure 4 Detailed visualization of chromothripsis in a breast-cancer cell line. On chromosome 17 of the breast-cancer cell line HCC2218, 
total copy number varies on the q-arm whereas the minor copy number stays at 1. This is consistent with one of the homologs having 
undergone chromothripsis. Top panel: Scatter plot of allelic imbalance ratio and normalized coverage for segments on chromosome 17, colored 
in blue-red gradient. Grey segments are those located on other chromosomes. Bottom panel: Total copy number in blue-red gradient 
depending on chromosomal location. Grey indicates minor copy number. 



some tumors. Specialized methods are available for this 
purpose, joining copy number and genotyping analysis. 
One such tool is HATS (Haplotype Amplification in 
Tumor Sequences), developed for identification of ampli- 
fied alleles using haplotype information [28]. We suggest 
using Patchwork to identify allele-specific copy numbers, 
and HATS to identify individual higher-copy SNP alleles 
in regions where the original homologs have unequal copy 
number. It should be pointed out that HATS is not 
designed to identif)^ which haplotype is the background for 
somatic mutations. Reads covering both the somatic muta- 
tion and a second polymorphic site can be used for that 
purpose [29]. 



Patchwork is designed for whole-genome sequencing 
data. Although most aspects of Patchwork would also be 
viable for whole exome sequencing, such data are differ- 
ent in some important respects. Exome sequencing relies 
on enrichment strategies that may cause saturation 
effects and require different normalization. Sequencing of 
such a small portion of the genome reduces the number 
of informative heterozygous markers and requires a dif- 
ferent segmentation solution. Other tools have been 
developed specifically for detection of copy number aber- 
rations and LOH in cancer from exome sequencing data 
[30,31]; however, they do not provide ASCNA nor take 
aneuploidy into account. 
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Copy number analysis is still usually performed using 
SNP arrays because of the lower costs and DNA require- 
ments, easier data handling, and mature analysis tools. 
With Patchwork, we have taken an analysis strategy ori- 
ginally conceived for SNP arrays, and transformed it into a 
tool that extracts similar data from whole-genome sequen- 
cing. After normalization and SNP identification, the ana- 
lysis strongly resembles that of array data. Within the 
sample, a relative change in signal intensity (normalized 
for sequence or hybridization bias) represents a change in 
copy number. Whereas microarrays are subject to hybridi- 
zation effects such as saturation (limiting sensitivity at 
high copy numbers), normalized sequence-read coverage 
is proportional to the copy number of the original cells. 
Another potential advantage with sequencing is that 
paired-end assays and/or local reassembly of reads can be 
used to map breakpoints in greater detail than with micro- 
arrays. We expect that future versions of Patchwork will 
be able to use such information to complement CBS and 
generate a much more detailed characterization of the 
cancer genome than is currently possible with SNP arrays. 

The Patchwork software 

The Patchwork website (http://patchwork.r-forge.r-project. 
org/) has documentation and links to available R packages, 
installable via R-forge [32]. Currently two versions are avail- 
able, one for use on BAM (Binary sequence alignment/ 
map) -formatted data and one for use on ASM-formatted 
(Complete Genomics) data. Detailed instructions, including 
examples and tutorials are also available. Patchwork runs 
on desktop computers. 

Conclusions 

Many studies have shown that analysis of copy numbers 
and LOH is an important part of genome characterization 
in cancer, and that DNA microarrays are suitable for the 
task. Bioinformatic tools capable of ASCNA of cancer gen- 
omes have been available for SNP array data for some 
time, but tools for whole-genome sequencing data have 
lagged behind. With Patchwork, we have developed a tool 
with which whole-genome sequencing, even at modest 
sequence coverage, can be used for ASCNA of cancer 
genomes. 

Materials and methods 

Patchwork data input 

Patchwork takes BAM-formatted aligned reads as input, 
which is the standard output from most short-read 
aligners. ASM-formatted data (Complete Genomics) are 
also supported, and other formats may be added in the 
near future. Single-nucleotide variant data (for allele-spe- 
cific quantification) is extracted using SAMtools [33], 
and discovered variants are filtered using a list of known 
SNPs (dbSNP) [34]. If a patient-matched normal sample 



is available, it is used to improve the ability of Patchwork 
to identify constitutive heterozygous SNPs, which are 
informative for allele-specific analysis. 

GC content normalization 

BAM formatted data are divided into short (200 bp) 
windows, which are normalized for GC content bias. 
The normalization process groups the windows based 
on GC content (extracted from, in this case, the human 
genome assembly hgl9) and normalizes each group 
based on the read count of each window relative to the 
group average. This strategy resembles what is used in 
other methods and is extremely effective because GC 
content tends to correlate non-linearly with sequence 
coverage, and differs depending on the sequencing plat- 
form and library preparation [22]. 

Positional normalization 

For normalization of unknown positional bias. Patchwork 
uses either a patient-matched normal sample or a refer- 
ence file based on diploid samples sequenced with the 
same sequencing protocol. The reference data are normal- 
ized for GC content as described above, and in case of sev- 
eral reference samples, averaged for each 200 bp window. 
Reference files are provided for lUumina/Solexa and Life 
Technologies/Solid data, and can easily be prepared for 
other types of data. 

Smoothing and segmentation 

Normalized coverage, relative to the reference, is sum- 
marized in 10 kb, fixed-window bins along each chro- 
mosome arm. Chromosome arms are then individually 
segmented using circular binary segmentation (the 
DNAcopy package) [35]. Each segment is assigned aver- 
age normalized coverage and allelic imbalance ratio, 
which is 

high low)/ ^ high, 

where Zlow and Zhigh are the number of reads with 
lower and higher observed allele counts, summed for all 
heterozygous SNPs in the segment. 

Copy number visualization and analysis 

Patchwork generates color-coded figures for each chromo- 
some, with a gradient from blue on the distal p-arm to red 
on the distal q-arm. These figures form the primary result, 
and allow the analyst to interpret the sample in terms of 
average ploidy, coverage and copy number relationship, 
LOH, tumor cell content, and tumor cell heterogeneity. 
Ploidy can be determined from the cluster pattern, with 
one possible cluster for copy number 1 (ImO), two possi- 
ble clusters for copy number 2 (2ml and 2m0), and so on. 
An automated copy number calling method similar to that 
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of TAPS is also available. It requires an initial interpreta- 
tion of the figures (currently the approximate coverage dif- 
ference of a single copy, and the alleUc imbalance ratios 
corresponding to copy number 2 with and without LOH). 
The algorithm assigns allele-specific copy number to 
genomic segments, based on the initial interpretation and 
knowledge of the figure patterns. 

The copy number analysis algorithm also calculates the 
average ploidy and purity of the tumor cells. The average 
ploidy, Ploidyxumj is the average total copy number of all 
genomic segments weighted by segment length. The purity 
estimate is based on Aobs> the observed difference in nor- 
malized coverage, corresponding to one copy in the tumor 
cells, and Agxp, which is the difference that can be expected 
in a sample with 100% pure tumor cells. Agxp is obtained 
from Ploidyxum (equation 1). For example, a triploid sam- 
ple has three copies at the average normalized coverage, 
and loss or gain of one copy in a particular region would 
be expected to alter the normalized coverage by one-third. 
Copy number alterations in a fraction of the DNA would 
further reduce the effect by that fraction. Thus, the tumor 
DNA content DNAfracxum can be expressed in terms of 
Aobs and Agxp, (equation 2). DNAfracxum is also a function 
of tumor cell content (purity) CellfracTum and the ploidy of 
the tumor and normal cells Ploidyxum and PloidyNorm 
(equation 3). Patchwork assumes that PloidyNorm = 2 and 
CellfracNorm = 1-CellfracTum and calculates CellfracTum 
accordingly (equation 4). 



1 



^exp 
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Turn 



DNAfracTum 



Aobs 
Aexp 
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assembly from UCSC using Bowtie [36] . Sequenced reads 
from a breast-cancer primary tumor, matched non-tumor 
tissue, metastasis and xenograft, originally published by 
Ding et al [26] were obtained from dbGAP [phs000245.vl. 
pi] and aligned using Bowtie. ASM-formatted sequence 
data from HCC1187 and HCC2218 (assembly software ver- 
sion 2.0.0.32) were obtained from Complete Genomics 
[37,38]. 

Performance evaluations 

The sequenced reads from the HCC1954 cancer cell line 
were diluted by adding reads from the patient-matched 
blood cell line HCC1954BL using a random-number 
generator. Reads were selected with a probability based 
on HCC1954 total coverage, HCC1954 average ploidy 
(nearly tetraploid), and the desired tumor cell content. 
The diluted samples and the ASM-formatted sequence 
data from Complete Genomics were analyzed with 
Patchwork. 

For HCC1187 and HCC1954, sensitivity and specificity 
were calculated for each allele-specific copy number by 
comparing the Patchwork results, with the TAPS (micro- 
array) gold standard. Patchwork-generated segments larger 
than 1 Mb with at least 75% overlap with the microarray 
data were used. Exact total and minor copy number 
matching was required. Sensitivity was calculated as true 
positives/ (true positives + false negatives) and specificity as 
true negatives /(true negatives + false positives). Chromo- 
somes 5, 8, 13, 15, and 17 were excluded from the analysis 
of the HCC1954 cell line (see Results section; see Addi- 
tional file 1 Supplemental data). Performance results for 
the most abundant copy number compositions (>15 seg- 
ments) were used for Figure 2. The accuracy of the TAPS 
analysis was confirmed using publicly available SKY karyo- 
types [24]. 

Breast-cancer tissue samples 

Similarity of Patchwork results from the breast-cancer 
primary tumor and metastasis samples was confirmed 
by matching average copy number and gain, loss, or 
unchanged copy number along the genome at base-pair 
resolution. 



Data acquisition and processing 

Microarray data (Affymetrix SNP6) for HCC1954 and 
HCC1187 were acquired from GEO [GEO:GSE13372; 
GSE36138], preprocessed in Nexus Copy Number (ver- 
sion 5.0) and analyzed for allele-specific copy number 
using TAPS. SKY karyotypes of HCC1187 and 
HCC1954 were acquired from the University of Cam- 
bridge [24]. 

Sequence data from HCC1954/HCC1954BL originally 
published by Chiang et al, [21], was obtained from SRA 
[SRA:SRA001246] and aligned to the 'hgl9' human genome 



Breast-cancer tissue dataset 

Funding support for the breast-cancer primary tumor, 
metastasis and xenograft sequence data was provided by 
grants from Washington University in St. Louis and the 
National Human Genome Research Institute (NHGRI 
U54 HG003079), the National Cancer Institute (NCI 1 
UOl CAl 14722-01), the Susan G Komen Breast Cancer 
Foundation (BCTR0707808), and the Fashion Footwear 
Charitable Foundation, Inc. NCI UIO CA076001. Breast 
Cancer Research Foundation grant awarded to the Amer- 
ican College of Surgeons Oncology Group supported the 
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acquisition of samples for recurrence testing. The tissue 
procurement core was supported by an NCI core grant 
(NCI 3P50 CA68438). The Human and Mouse Linked 
Evaluation of Tumors Core was supported by the Insti- 
tute of Clinical and Translational Sciences at Washington 
University (CTSA grant ULl RR024992). Illumina, Inc. 
and Washington University also supported this dataset 
through the Washington University Cancer Genome 
Initiative. 

Additional material 



Additional File 1: Comparison of TAPS (Tumor Aberration 
Prediction Suite) and Patchwork analyses of the breast-cancer cell 
line HCC1954. 

Additional File 2: Patchwork copy number profiles of breast-cancer 
primary tumor, metastasis, and xenograft, based on sequence data 
originally published by Ding et al. [26], 
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